home
***
CD-ROM
|
disk
|
FTP
|
other
***
search
/
Aminet 41
/
Aminet 41 (2001)(Schatztruhe)[!][Feb 2001].iso
/
Aminet
/
util
/
cli
/
SED.lha
/
SED
/
SED.doc
< prev
next >
Wrap
Text File
|
2000-12-09
|
19KB
|
468 lines
Amiga SED An Amiga Stream Editor © THOR-Software (Thomas Richter)
______________________________________________________________________________
Purpose of this program:
SED takes an input file, checks each line of this file against a
pattern supplied on the command line, and generates a new line from
this pattern match in the destination. This could either mean that
the matching line is removed completely from the output, replaced
by a different line, or changed according to the specifications of
SED.
SED is an approximation of the Unix "stream editor" sed. It is not
quite as powerful as sed because its command set is currently very
limited, and it does not support command files. Its pattern syntax
is different, too. It still looks like "line noise" to me - I didn't
want to break with this tradition - but it's at least the Amiga kind
of line noise.
Its pattern matching rules are a superset of the AmigaOs patterns,
with some additional features like "captured expressions" and more
powerful "character classes" and "escaping".
SED is useful for automatic processing of text files, e.g. the modi-
fication of the startup-sequence. SED can also be run as a "filter"
in which case it reads its input from stdin and prints output to
stdout. Combine this feature with pipes and you get a very powerful
text processing tool.
A warning: Pattern matching looks simple, but is full of hard to gasp
traps. This tool is therefore thought to be for "expert usage". In
case you think SED doesn't process your pattern correctly, think
twice!
______________________________________________________________________________
SYNOPSIS:
SED FROM,TO,MATCH/A,REPLACE,CHANGE,DELETE/S,USECASE/S,ALL/S,VERBOSE/S
FROM An (AmigaOs) pattern specifying the input file(s) to process.
If not given, SED reads from the standard input.
TO The output file where to write the processed lines to. If not
given, SED writes to the standard output.
MATCH A pattern specification used for filtering the input lines.
More on the pattern rules below.
The next options specify what to do with the lines matching the pattern:
REPLACE Replace matching lines in the input by this replace rule, do
not write non-matching lines to the output. The replacement
rules are given below.
CHANGE Replace matching lines by the replace rule given by this ex-
presson. In contrast to REPLACE, non-matching lines are placed
in the destination without change. Useful for modifying a file
according to a pattern rule.
DELETE Print all lines except those matching the pattern. This
effectively removes the matching lines from the input file.
USECASE Be case-sensitive. By default, SED is case-insensitive.
Note that SED differs in this detail from the Un*x sed.
ALL In case the FROM pattern is a wildcard, enter sub-directories
recursively.
VERBOSE Print information about the file currently scanned, and upon
entering a directory. Prints also file name and line number
information for found matches. By default, SED is quiet.
Note that "Search" is by default not quiet.
______________________________________________________________________________
Pattern specification:
In the following, the syntax of the patterns is specified. By good
tradition, this is in-comprehensively.
I present first a "quick and dirty" presentation of the available
patterns as a quick reference which might give you an impression about
the possibilities. It is all but sufficient to work with SED. Then, a
detailed and more precise, but also more confusing presentation
follows.
______________________________________________________________________________
Quick guide to patterns:
SED patterns work much like Amiga patterns. Unlike in "Search", a pattern is
applied to a FULL line, and not to sub-strings of this line. Which means that
the pattern "hello" matches ONLY the line containing the single word "hello",
nothing more. If you want to match lines containing "hello", use "#?hello#?"
instead - see below for what "#?" means. Unlike Un*x sed, there are no special
characters to match the start or the end of a line. They are not required in
the SED approach.
Standard patterns:
? Matches a single arbitrary character.
# Matches zero or more repetitions of the following symbol in
the AmigaOs sense. Note that # may match zero(!) characters
as well.
Therefore, #? matches an arbitrary sequence of at least zero
characters, hence any string.
+ Matches one or more repetitions of the following symbol
New to AmigaOs, standard Un*x regular expression.
* Matches zero or more arbitrary characters in the sense of
MS-DOS. Note that this is a functional difference to Un*x
regular expressions where * has the meaning of #.
Note that you need to write ** instead of * if you enclose
the pattern in double quotes on the shell command line. This
is because * is also the BCPL escape character. Messy.
(...) Groups the characters in the bracket to a single symbol.
As for example, #(ab) would match an arbitrary repetition
of "ab", as the empty string, "ab", "abab" or "ababab", but
not "aba".
Brackets can be nested.
(..|..) The vertical bar means "or". Matches either the left or the
right string. The bar is only valid within brackets.
{...} Groups expressions much like (...) but captures the contents
of the sub string that matched the brackets. This captured
expression is then available for the ouput replacement rules,
see below for more information. For example,
SED MATCH {#?}.c REPLACE {1}.o
would match all lines ending on ".c", and would capture the
string in front of the ".c". The "{1}" in the replace pattern
would insert this string, and would append an ".o".
Namely, the above replaces all lines ending on ".c" by a
similar line ending by ".o".
Works very much line the Un*x \(..\) matching.
{..|..} The vertical bar works right the same way here as described
above. Matches either the left or the right expression, and
captures the expression that fits.
% Matches the empty string. Useful for patterns like
"#?(.c|.o|%)" which could be used to match the source, the
object code and the final executable of a C project, for
example.
~ Means "not" and matches all symbols that do not match the
following symbol. Be warned, ~ is full of traps, see below
for the full description.
[..] Character classes. Matches a single character on a range of
valid characters specified in the interior of the bracket.
For example, "[ac]" would match the single character "a" or
"c".
[..|..] Matches either the left or the right character range. Hence,
[a|c] is equivalent to [ac].
[..,..] Another equivalent formulation of the above. [a,c] is the same
as [ac] or [a|c].
[..-..] Matches a character range. [a-z] matches all letters - except
language specific "Umlaute", though, which have different en-
codings. Several ranges can be grouped much the same way as
single characters. [a-z|0-9] means "any character or any digit"
but nothing else.
[-..] Matches all characters up to the specified character. Hence,
[-z] means "all characters up to z". Note that unlike in Un*x
implementations, there are no messy rules concering the "["
itself as character. The escape character "\" must be used to
specify "[" or "]" itself, see below.
This syntax can be combined freely with "|" or "," to specify
more than one range.
[..-] Matches all characters starting at the given ASCII value. Can
be combined freely with "," and "|". There are no messy rules
concerning "-" in the middle or the end of a character range,
proper escaping must be used if "]" should be matched.
[a-] matches therefore all characters "a" and up.
[~..] Matches all characters not in the following range. ~ is applied
up to the next "|" or ",", unlike in the standard AmigaOs (Arp)
expression matching. Therefore,
[~ab] matches all characters except "a" and "b" and is
equivalent to [~a,~b] and [~a|~b].
\ Escape character. Specifies a character to be matched:
\t Tabulator \v Vertical TAB
\b Backspace \r CR
\f Form Feed \a Bell
\n is INVALID since the end of the line is matched
by the end of the pattern itself.
\x.. The character encoded by the hex value following
the "x". In case this specification is ambigious,
the number might be terminated by a dot ".".
Hence, "\x9.0" matches a tabulator sign and the
digit "0", whereas "\x90" matches the ASCII char-
acter of the code hex 90.
Note that this rule differs from the ANSI-C rule.
\0.. The character of the ASCII code encoded as an
octal number.
The dot is used as above as separator, unlike in
ANSI-C.
\$.. Identical to \x.., matches the digit encoded by
the ASCII code in hex.
\d Matches the dollar sign since \$ has a different
meaning already.
\#.. Matches the character encoded by the ASCII code
in decimal notation.
\h Matches the hash-mark since \# has a different
meaning already.
Everything else: The character following the backslash
itself. Especially, \\ is the backslash itself and \" is
the double quote.
Note that you must use the backslash to match characters
which are otherwise part of the pattern syntax, as for
example "\(" to match the bracket. Note that "#" and "$"
are special in this sense since "\$" and "\#" are used
to specify characters by ASCII code.
!,",§,$,&,=
-,^,',`,<,> are reserved for future use AND MUST NOT be used at all.
Escape them if you need them. However, the dot (".") is
free, unlike Un*x regexp, same goes for "@" and "/".
.. Everything else: Matches the character itself. Hence "a"
matches a single "a" much like "[a]".
______________________________________________________________________________
Replacement rules:
The arguments of REPLACE and CHANGE specify what do with the lines which
matched the specified pattern. Unlike the pattern specification, only the
special operators \ and {..} are allowed. All other operators from the
above list are forbidden and generate an error.
\ Escape character, works identically to the \ in the pattern
and places the single character encoded by the sequence
following the backslash on the output directly.
{..} Specifies a captured expression to be inserted into the
output stream. The brackets take up to three arguments:
The couting number of the regular expression, and optionally
two arguments how to format the regular expression separated
by a dot ".". These numbers work very much the same way like
the arguments to the %s format specifier in ANSI-C.
The first number in the bracket describes which captured
expression to insert. If it is a positive number, the number
is simply the index of the captured expression, counting from
one upwards.
Each opening bracket "{" in the input pattern starts a new
captured expression, hence in nested expressions the other-
most bracket has the lowest index.
If this number is negative, it counts the captured ex-
pressions downwards from the last expression.
If the specified expression does not exist, the brackets
expand into an empty string that is formatted according to
the rules given by the next three arguments.
{1} is the first captured expression,
{3} is the third expression,
{-1} is the last expression,
{-2} is the second to last expression.
The next number is the field with to print the captured ex-
pression in. At least the specified number of characters are
printed, or more if the expression is longer. If the ex-
pression is too short, the field is padded with blank spaces.
The expression is right-justified into this field, unless
the field width is negative in which case the expression is
left-justified. The sign of the field width is otherwise
ignored.
Defaults to 0, i.e. the field is always as small as possible.
The last number is the size limit of the expression. The
expression will be cut down if it is longer than the specified
limit. SED will cut the end of the string if this argument is
positive, or the start of the string if it is negative. The
sign of the limit is otherwise ignored.
If the limit is 0, which is the default, the expression will
not be cut down at all.
{1.10} is the first captured expression right justified in
a field of ten characters or longer.
{2.-5.7}is the second captured expression, left justified in
a field of five characters. At most seven characters
of the expression will be printed.
.. Everything else: The character itself is printed on the ouput.
______________________________________________________________________________
Detailed pattern matching rules:
And now for the detailed rules to confuse you completely:
- A SYMBOL is either a single character, one of the following operators
followed by its arguments, a character class [..] or a (..) or {..} sequence.
- A PATTERN is a sequence of SYMBOLs.
- The POSTFIX of a symbol in a pattern is the subsequence of the pattern
following the symbol, not including the argument of the symbol itself.
A POSTFIX is always a PATTERN itself.
? Matches a single character except the end of a string.
# Matches as many repetitions of the following SYMBOL, but
at least zero such that the POSTFIX of the symbol matches
the remaining input.
Hence, "#" is greedy. There is currently no non-greedy form.
+ Matches as many repetitions of the following SYMBOL but
at least one such that the POSTFIX of the SYMBOL matches the
remaining input. "+" is greedy.
* is fully equivalent to "#?" and therefore greedy.
(...) groups the PATTERN up to the next | or ) into a SYMBOL
which matches if the contents of the brackets match.
(..|..) An or-combined SYMBOL matches if one of the PATTERNS
in the bracket match such that the POSTFIX matches the
remaining input.
{...},{..|..} Similar to the above except that the matched string is
captured.
% Is completely ignored as pattern and gobbles no character
from the input sequence at all.
~ Matches the longest subsequence or at least zero characters
that does not match the following SYMBOL such that the
POSTFIX of the SYMBOL still matches the remaining input.
"~" is greedy and will try to match as many characters first.
Note that a SYMBOL could either be a single character or a
sequence of characters grouped by () or #. Since a single
character cannot match a string larger or smaller than one
character, ~ followed by a one-character symbol will match
all subsequences except those whose postfix either don't
match the postfix of the character, or which match the
character and the postfix.
This is *very* tricky and you should think about the con-
sequences of this rule twice. More examples below.
[..] Character classes. Groups a range of characters into a
SYMBOL that matches exactly a single character, but never
the empty string.
~ in character classes is special: If there is a not-sequence
in a character class, it matches if all not-sequence match at
once and one or more of the ordinary sequences match. Hence
[~p,~q,a-z]
matches all letters except p and q.
.. Everything else matches exactly the the one character that
it represents. They will not match the empty string.
______________________________________________________________________________
Some examples of patterns to think of:
% Matches only empty lines in the input.
~% Matches only non-empty lines in the input.
#?.c Matches all lines ending on ".c"
The#? Matches all lines starting with "The".
#?Example#? Matches all lines containing the word "Example".
Example#? Matches all lines starting with the word "Example".
Example Matches all lines consisting of the single word "Example".
#? Matches all lines.
#?(.c|.o|%) Matches all lines (think about why!).
foo(.c|.o|%) Matches all lines consisting entirely of the word "foo", "foo.c"
or "foo.o".
foo(.c|.o|) Just the same.
~(Example) Matches all lines except the line consisting of the single word
"Example".
~(#?Example#?) Matches all lines that do not contain the word "Example".
~(ab)cd Matches all lines that do not start with "ab" and that end on
"cd". Especially, this would match "bccd". It would also match
the line "cd" since "ab" does not match the empty sequence in
front of "cd". (think about this!)
~#a.c Matches all lines ending by ".c" except those where the ".c" is
prefixed by an arbitrary number of a's, including zero a's.
Hence, it would match "bc.c" and even "ab.c", but not "a.c" or
".c" as the last consists of zero a's and one ".c". It would
not match "bc.o". This is identically to ~(#a).c since # binds
the following a.
~(#ab)#? Matches all lines except those starting with a possibly empty
sequence of a's followed by a single b. Hence, does not match
aaabccc or bccc.
~(#[ ,\t];)#? Matches all lines except those starting with a possibly empty
sequence of blanks or tabulators followed by a colon. Hence,
for a shell script, this would match all non-comment lines.
~(#[ ,\t];)if#? This is a tricky one. Unlike what you might think, this does
not match all non-comment lines starting with if. It also
matches lines starting with a semicolon provided the string
"if" is in the line and not directly behind the semicolon.
Note that this is the intended behaivour. For example, it
would match
;aifb
The reason is simple: This is a string ";a" that does not
match the symbol #[ ,\t]; followed by "ifb" which matches
if#?.
What you want here instead is #[ ,\t]if#? which matches
all if-lines with additional, at least zero, blanks or tabs
in front.
The above example shows again the tricky nature of pattern
matching.
A real life example would be
sed from S:Startup-Sequence match "{#[ ,\t]}RunBack{#?}" change "{1}Launch{2}"
which would replace all invocations of "RunBack" in the Startup-Sequence by
similar invocations of "Launch".
Another example to think about as exercise:
({}|{#[~;]+[ \t]}#[~; \t][/:]|{}#[~; \t][/:]|{#[~;]+[ \t]})FooBar{|[ \t;]#?}
Yes, this pattern is useful. Consider again S:Startup-Sequence as input file
and think about what this could possibly do. Note that some expressions are
captured. (Hey, I said this would look like line noise!)
______________________________________________________________________________
Thomas Richter,
December 2000